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Abstract 

This  paper  describes  a  novel  approach  for  handling  translation  divergences  in  a  Generation- 
Heavy  Hybrid  Machine  Translation  (GHMT)  system.  The  approach  depends  on  the 
existence  of  rich  target  language  resources  such  as  word  lexical  semantics,  including  in¬ 
formation  about  categorial  variations  and  subcategorization  frames.  These  resources  are 
used  to  generate  multiple  structural  variations  from  a  target-glossed  lexico-syntactic  rep¬ 
resentation  of  the  source  language  sentence.  The  multiple  structural  variations  account 
for  different  translation  divergences.  The  overgeneration  of  the  approach  is  constrained 
by  a  target-language  model  using  corpus-based  statistics.  The  exploitation  of  target 
language  resources  (symbolic  and  statistical)  to  handle  a  problem  usually  reserved  to 
Transfer  and  Interlingual  MT  is  useful  for  translation  from  structurally  divergent  source 
languages  with  scarce  linguistic  resources.  A  preliminary  evaluation  on  the  application 
of  this  approach  to  Spanish-English  MT  proves  this  approach  extremely  promising.  The 
approach  however  is  not  limited  to  MT  as  it  can  be  extended  to  monolingual  NLG  ap¬ 
plications  such  as  summarization. 
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Abstract 

This  paper  describes  a  novel  approach  for  han¬ 
dling  translation  divergences  in  a  Generation- 
Heavy  Hybrid  Machine  Translation  (GHMT) 
system.  The  approach  depends  on  the  exis¬ 
tence  of  rich  target  langnage  resonrces  snch 
as  word  lexical  semantics,  inclnding  informa¬ 
tion  abont  categorial  variations  and  snbcate- 
gorization  frames.  These  resonrces  are  nsed 
to  generate  mnltiple  strnctnral  variations  from 
a  target-glossed  lexico-syntactic  representation 
of  the  sonrce  langnage  sentence.  The  mnltiple 
strnctnral  variations  acconnt  for  different  trans¬ 
lation  divergences.  The  overgeneration  of  the 
approach  is  constrained  by  a  target-langnage 
model  nsing  corpns-based  statistics.  The  ex¬ 
ploitation  of  target  langnage  resonrces  (sym- 
bohc  and  statistical)  to  handle  a  problem  nsn- 
ally  reserved  to  Transfer  and  fnterhngnal  MT 
is  nsefnl  for  translation  from  strnctnrally  diver¬ 
gent  sonrce  langnages  with  scarce  hngnistic  re¬ 
sonrces.  A  preliminary  evalnation  on  the  appli¬ 
cation  of  this  approach  to  Spanish- English  MT 
proves  this  approach  extremely  promising.  The 
approach  however  is  not  hmited  to  MT  as  it  can 
be  extended  to  monolingnal  NLG  applications 
snch  as  snmmarization. 

1  Introduction 

We  present  a  Generation- Heavy  Machine  Trans¬ 
lation  (GHMT)  system  that  is  asymmetrical  hy¬ 
brid  approach  to  Machine  Translation:  onr  gen¬ 
eration  component  constrains  the  translation 
nsing  a  combination  of  symbolic  rnles,  lexicons, 
and  corpns-based  statistics.  Sonrce  langnages 
are  only  expected  to  have  a  syntactic  parser  and 
a  translation  lexicon  that  maps  sonrce  words  to 
target  bags  of  words.  No  transfer  rnles  or  com¬ 
plex  interlingnal  representations  are  reqnired. 
The  approach  depends  on  the  existence  of  rich 


target  langnage  resonrces  snch  as  word  lexi¬ 
cal  semantics,  inclnding  information  abont  cat¬ 
egorial  variations  and  snbcategorization  frames. 
These  resonrces  are  nsed  to  generate  mnlti¬ 
ple  strnctnral  variations  from  a  target-glossed 
lexico-syntactic  representation  of  the  sonrce  lan¬ 
gnage  sentence.  The  mnltiple  strnctnral  vari¬ 
ations  acconnt  for  different  translation  diver¬ 
gences.  The  overgeneration  of  the  approach 
is  constrained  by  a  target-langnage  model  ns¬ 
ing  corpns-based  statistics.  The  exploitation  of 
target-langnage  resonrces  (symbolic  and  statis¬ 
tical)  to  handle  a  problem  nsnally  reserved  to 
Transfer  and  fnterhngnal  MT  is  nsefnl  for  trans¬ 
lation  from  strnctnrally  divergent  sonrce  lan¬ 
gnages  with  scarce  hngnistic  resonrces.  A  pre¬ 
liminary  evalnation  on  the  application  of  this 
approach  to  Spanish- English  MT  proves  this 
approach  extremely  promising.  The  approach 
however  is  not  limited  to  MT  as  it  can  be  ex¬ 
tended  to  monolingnal  NLG  applications  snch 
as  snmmarization. 

The  work  presented  here  focnses  on  the  gener¬ 
ation  component  of  this  system  and  its  handling 
of  translation  divergences.  The  next  section  de¬ 
scribes  the  range  of  divergence  types  covered  in 
this  work  and  discnsses  previons  approaches  to 
handling  them  in  MT.  Section  (3)  and  (4)  intro- 
dnce  onr  approach  and  describes  the  different 
components  and  algorithms  in  the  translation 
system.  And  hnally,  section  (5)  describes  a  pre¬ 
liminary  evalnation  we  nndertook  to  assess  the 
applicability  of  this  approach  on  a  large  scale  to 
Spanish- English  MT. 

2  Background 

A  translation  divergence  occnrs  when  the  nn- 
derlying  concept  or  “gist”  of  a  sentence  is  dis- 
tribnted  over  different  words  for  different  lan¬ 
gnages.  For  example,  the  notion  of  floating 


across  a  river  is  expressed  as  float  across  a  river 
in  English  and  cross  a  river  floating  (atraveso 
el  n'o  flotando)  in  Spanish  (Dorr,  1993b).  An 
investigation  done  by  (Dorr  et  ah,  2002)  fonnd 
that  divergences  occnrred  in  approximately  1 
ont  of  every  3  sentences  in  a  sample  size  of  19K 
sentences  from  the  TREC  El  Norte  Newspaper 
Corpns.  This  analysis  was  done  on  the  TREC 
Spanish  Data^  nsing  antomatic  detection  tech- 
niqnes  followed  by  hnman  conhrmation.  We  will 
describe  each  divergence  type  before  tnrning  to 
alternative  approaches  to  handling  these  in  MT. 

2.1  Translation  Divergences 

While  there  are  many  ways  to  classify  diver¬ 
gences,  we  present  them  here  in  terms  of  hve 
specihc  divergence  types  that  can  take  place 
alone  or  in  combination  with  other  types  of 
translation  divergences.  Table  (1)  presents 
these  divergence  archetypes  with  Spanish- 
Enghsh  examples.  The  last  colnmn  displays  a 
percentage  of  occnrrence  of  the  specihc  diver¬ 
gence  type,  taken  from  the  hrst  48  verb-nniqne 
instances  of  Spanish- English  divergences  from 
the  TREC  El  Norte  corpns.  Note  that  these 
nnmbers  do  not  rehect  the  percentage  of  occnr¬ 
rence  of  the  divergence  type  in  the  corpns  as  a 
whole,  bnt  rather  the  percentage  of  occnrrence 
of  the  specihc  divergence  type  in  the  hrst  48 
divergent  sentences — and  there  is  often  overlap 
among  the  divergence  types  (e.g.,  categorial  di¬ 
vergence  occnrs  almost  every  time  there  is  any 
other  type  of  divergence). 

2.1.1  Categorial  Divergence 

Categorial  divergence  involves  a  translation 
that  nses  different  parts  of  speech.  This  is  by 
far  the  most  common  divergence  type.  In  the 
Spanish- English  example  below,  the  light  verb 
and  nonn  phrase  are  translated  as  another  light 
verb  and  an  adjectival  form  of  the  nonn. 

(1)  tener  celos  (to  have  jealousy)  O  to  be  jealous 
tener  plena  conciencia  (have  full  awareness)  O 
to  be  fully  aware 

2.1.2  Conflation 

Conhation  involves  the  translation  of  two  words 
nsing  a  single  word  that  combines  the  meaning 
of  the  two.  In  Spanish- English  translation,  this 

WDC  catalog  no  LDC2000T51,  ISBN  1-58563-177-9, 
2000. 


divergence  type  has  two  forms:  light  verb  con¬ 
flation  and  manner  conflation.  Light  verb  con¬ 
flation  involves  a  single  verb  in  one  langnage 
being  translated  nsing  a  combination  of  a  se¬ 
mantically  “light”  verb,  i.e.,  it  carries  little  or 
no  specihc  meaning  in  its  own  right,  and  some 
other  meaning  nnit  (perhaps  a  nonn)  to  convey 
the  appropriate  meaning.  English  light  verbs 
inclnde  give,  make,  do,  take,  and  have. 

(2)  dar  una  patada  (give  a  kick)  O  to  kick 
poner  fin  (put  end)  O  to  end 

tomar  nota  (take  note)  O  to  note 

Manner  conhation  involves  translating  of  a 
single  manner  verb  (e.g.,  float)  as  a  hght  verb  of 
motion  and  a  manner-indicating  content  word 
that  is  typically  a  progressive  manner  verb  in 
Spanish. 

(3)  to  float  O  ir  flotando  (go  (via)  floating) 
to  pass  O  ir  pasando  (go  passing) 

2.1.3  Structural  Divergence 

A  structural  divergence  involves  the  reahzation 
of  incorporated  arguments  such  as  subject  and 
object  as  obliques  (i.e.  headed  by  a  preposition 
in  a  PP)  or  vice  versa. 

(4)  entrar  en  la  casa  (enter  in  the  house)  O  to  enter 
the  house 

pedir  un  referendum  (ask-for  a  referendum)  O 
ask  for  a  referendum 

2.1.4  Head  Swapping 

This  divergence  involves  the  demotion  of  the 
head  verb  and  the  promotion  of  one  of  its  modi- 
hers  to  head  position.  In  other  words,  a  permu¬ 
tation  of  semantically  equivalent  words  is  nec¬ 
essary  to  go  from  one  language  to  the  other.  In 
Spanish,  this  divergence  is  typical  in  the  trans¬ 
lation  of  an  English  motion  verb  and  a  preposi¬ 
tion  as  a  directed  motion  verb  and  a  progressive 
verb. 

(5)  entrar  corriendo  (enter  running)  O  to  run  in 
andar  volando  (go-about  flying)  O  to  fly  about 

2.1.5  Thematic  Divergence 

A  thematic  divergence  occurs  when  the  verb’s 
arguments  switch  thematic  roles  from  one  lan¬ 
guage  to  another.  The  Spanish  verbs  gustar  and 
doler  are  examples  of  this  case. 
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Divergence 

Spanish 

English 

Occnrrence 

Categorial 

X  tener  hambre 

(X  have  hunger) 

X  be  hungry 

98% 

Conflational 

X  dar  puhaladas  a  Z 

(X  give  stabs  to  Z) 

X  stab  Z 

83% 

Strnctnral 

X  entrar  en  Y 

(X  enter  in  Y) 

X  enter  Y 

35% 

Head  Swapping 

X  cruzar  Y  nadando 

(X  cross  Y  swimming) 

X  swim  across  Y 

8% 

Thematic 

X  gustar  a  Y 

(X  please  to  Y) 

Y  like  X 

6% 

Table  1:  Translation  Divergence  Types 


(6)  Me  gustan  uvas  (to-me  please  grapes)  O  I  like 
grapes 

me  duele  la  cabeza  (to-me  hurt  the  head)  O  I 
have  a  headache 

2.2  Handling  Translation  Divergences 

Since  translation  divergences  reqnire  a  combi¬ 
nation  of  lexical  and  strnctnral  manipnlation 
,  they  are  traditionally  minimally  handled  at 
the  transfer  level  of  the  MT  Hierarchy.  A  pnre 
transfer  approach  is  a  brnte  force  attempt  to 
encode  all  translation  divergences  in  a  trans¬ 
fer  lexicon  (Dorr  et  ah,  1999).  However,  more 
sophisticated  techniqnes  have  been  developed 
that  nse  Lexical  Semantic  knowledge  to  de¬ 
tect  and  handle  these  phenomena.  An  Inter- 
lingnal  approach,  proposed  by  (Dorr,  1993b; 
Dorr,  1994),  nses  Jackendoff’s  Lexical  Seman¬ 
tic  Strnctnre  (LCS)  (Jackendoff,  1972;  Jackend- 
off,  1976;  Jackendoff,  1983;  Jackendoff,  1990) 
as  an  interhngna.  LCS  is  a  compositional  ab¬ 
straction  with  langnage-independent  properties 
that  transcend  strnctnral  idiosyncrasies.  This 
representation  has  been  nsed  as  the  interlingna 
of  several  projects  snch  as  UNITRAN  (Dorr, 
1993a)  and  MILT  (Dorr,  1997).  LCS  provides 
a  grannlarity  of  representation  mnch  hner  than 
syntactic  representation  and  mnch  more  inde¬ 
pendent.  As  an  example,  the  Spanish  sentence 
atraveso  el  no  flotando  can  be  “composed”  into 
the  following  LCS  nsing  a  Spanish  LCS  lexicon 
as  part  of  an  interlingnal  analysis  step. 

(7)  [event  CAUSE  JOHI 

[event  GO  JOHI 

[path  ACROSS  JOHI 

[position  AT  JOHI  RIVER]]] 
[manner  SHIM+IIGLY]] 

In  the  generation  phase,  that  same  LCS  is  “de¬ 
composed”  nsing  LCS  English  lexicon  entries  to 
yield  John  swam  across  the  river.  A  detailed 


discnssion  of  generation  from  LCS  is  available 
in  (Tranm  and  Habash,  2000). 

An  alternative  approach  nsing  lexico- 
strnctnral  transfer  enriched  with  lexical 
semantic  featnres  was  proposed  by  (Nasr  et  ah, 
1997).  In  this  lexicalized  grammar  approach  a 
nnihed  syntactic  and  semantic  representation 
is  nsed  for  each  lexical  item  which  inclnde 
appropriate  cross-hngnistic  semantic  featnres. 
Transfer  lexicon  rnles  are  written  as  snch  to 
captnre  generalizations  across  the  langnage 
pair.  The  transfer  is  done  at  the  Deep  Syntac¬ 
tic  Strnctnre  (DSyntS)  Level  from  Mel’cnk’s 
Meaning  Text  Theory  (Mel’cnk,  1988).  The 
approach  also  nses  Lexical  Fnnctions  (also 
from  Mel’cnk’s  Meaning  Text  Theory  (Mel’cnk, 
1988))  to  handle  analysis  and  generation.  The 
following  transfer  rnle  can  be  nsed  to  handle 
the  head  swapping  divergence  discnssed  in  the 
last  example: 

(8)  aTRAIS_CORR 

QEI  VI  [cat: verb  manner :M] 

(ATTR  Y  [cat: prep  path:P  event: go] 
(II  D) 

®SP  V2  [cat: verb  path:P  event: go] 

(II  I 

ATTR  Z  [manner : M] ) 

Here,  a  transfer  correspondence  is  estab¬ 
lished  between  the  different  components  of  two 
DSyntS  templates,  one  for  English  and  one  for 
Spanish.  Note  how  the  manner  variable  M  and 
the  path  variable  P  switch  dominance. 

A  major  hmitation  of  the  interlingnal  and 
transfer  approaches  is  that  they  reqnire  a  large 
amonnt  of  exphcit  lexical  semantic  knowledge 
for  both  sonrce  and  target  langnages. 

3  Our  Approach:  Generation-Heavy 
Machine  Translation 

Onr  approach  is  closely  related  to  the  hy¬ 
brid  approach  whose  intnition  was  hrst  de- 


scribed  by  the  seminal  work  of  (Knight  and 
Hatzivassiloglou,  1995;  Langkilde  and  Knight, 
1998b;  Langkilde  and  Knight,  1998c;  Langk¬ 
ilde  and  Knight,  1998a).  The  idea  is  to  com¬ 
bine  symbolic  and  statistical  knowledge  in  gen¬ 
eration  through  a  two  step  process:  (1)  Sym- 
bohc  Overgeneration  followed  by  (2)  Statis¬ 
tical  Extraction.  The  hybrid  approach  has 
been  mainly  used  for  lexical  choice  (including 
morphology  and  tense  selection) (Langkilde  and 
Knight,  1998c;  Bangalore  and  Rambow,  2000a) 
and  for  linearization  from  semantic  represen- 
tation(Langkilde  and  Knight,  1998a)  or  from 
shallow  unlabeled  dependencies  (Bangalore  and 
Rambow,  2000b). 

What  we  propose  here  is  the  extension  of 
the  hybrid  approach  to  handle  translation  di¬ 
vergences  without  the  use  of  a  deeper  seman¬ 
tic  representation  or  transfer  rules.  We  accom¬ 
plish  this  by  extending  the  symbohc  overgener¬ 
ation  component  to  include  structural  and  cat- 
egorial  expansion  of  the  source  language  lexico- 
structural  representation.  By  overgenerating 
lexico-structural  combinations  preferred  by  the 
target  language,  we  make  them  available  choices 
for  ranking  by  the  statistical  extraction  compo¬ 
nent.  the  overgeneration  is  constrained  by  lin¬ 
guistically  motivated  rules  that  utihzes  target 
language  lexical  semantics  and  subcategoriza¬ 
tion  frames  and  is  independent  of  the  source 
language  preferences. 

3.1  Overview  of  GHMT 

Figure  (1)  presents  an  overview  of  the  com¬ 
plete  MT  system.  The  three  phases  of  Anal¬ 
ysis,  Translation  and  Generation  are  very  sim¬ 
ilar  to  other  paradigms  of  MT  (Analysis- 
Transfer- Generation  or  Analysis-Inter  lingua- 
Generation)(Dorr  et  al.,  1999).  However,  these 
phases  are  not  symmetrical.  Analysis  relies  only 
on  the  source-sentence  parsing  and  is  indepen¬ 
dent  of  the  target  language.  The  output  of 
Analysis  is  a  deep  syntactic  dependency  that 
normalizes  over  syntactic  phenomena  such  as 
passivization  and  morphological  expressions  of 
tense,  number,  etc.  Translation  converts  the 
source-language  lexemes  into  bags  of  target- 
language  lexemes.  The  dependency  structure 
of  the  source  language  is  maintained.  The  last 
phase.  Generation,  is  where  most  of  the  work 
is  done  to  manipulate  the  input  lexically  and 
structurally  produce  target  sequences. 


The  generation  component  utilizes  three  ma¬ 
jor  resources:  a  word-class  lexicon,  a  categorial- 
variations  lexicon,  and  a  syntactic-thematic 
linking  map. 

3.1.1  Word-Class  Lexicon 

The  word-class  lexicon  currently  contains  only 
verbs  and  prepositions,  as  these  are  the 
predicate-argument  relations  primarily  involved 
in  translations — each  of  these  categories  are 
grouped  into  “classes.”  In  the  case  of  verbs, 
there  are  511  verb  classes  for  .3,131  verbs,  total¬ 
ing  8,650  entries.  An  example  is  shown  here: 

(9)  (DEFINE-WCLASS 

: NUMBER  "V. 13. l.a. ii" 

:NAME  "Give  -  Wo  Exchange" 

;SENTENCES  ("He  ! !+ed  the  car  to  John" 

"He  ! !+ed  John  the  car") 

:P0S  V 

;THETA_ROLES 

(((ag  obi)  (th  obi)  (goal  obi  to)) 
((ag  obi)  (goal  obi)  (th  obi))) 
:LCS_PRIMS  (cause  go  possessional) 

: SPEC  ((ag  (animate  +))) 

; WORDS  (feed  give  pass  pay  peddle  refund 
render  repay  serve)) 

In  the  case  of  prepositions,  there  are  43  prepo¬ 
sition  classes,  for  125  prepositions,  totaling  444 
entries.  An  example  is  shown  here: 

(10)  (DEFINE-WCLASS 
: NUMBER  "P.8" 

:WAME  "Preposition  Class  P.8" 

:P0S  P 

;THETA_ROLES  (time) 

:LCS_PRIMS  (path  temporal) 

:SPEC  nil 

; WORDS  (until  to  till  from  before  back_to 
at  after)) 

Note  that  these  entries  are  only  available  in 
the  system  for  English  since  it  is  the  target  lan¬ 
guage.  There  are  no  equivalent  entries  for  the 
source  language. 

3.2  Categorial- Variation  Database 

The  Categorial- Variation  Database  (CatVar)  is 
a  database  of  words  and  their  categorial  vari¬ 
ants.  Our  investigation  of  the  existence  of 
such  a  resource  so  far  shows  that  none  is  avail¬ 
able.^  Our  research  has  involved  the  creation 
of  resource  for  English.  The  structure  of  this 

^The  WordNet  project  is  currently  adding  such  links 
but  only  for  Nouns  and  Verbs  (Christiane  Fellbaum,  pc.). 
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database  is  rather  simple:  it  is  flat  with  an  in¬ 
dexing  tile  that  is  accessible  through  a  hash  ta¬ 
ble.  For  a  given  word  and  its  optional  parts  of 
speech,  the  lookup  mechanism  returns  a  list  of 
lists  of  categorial  variants  of  the  word  (including 
the  word  itself).  An  excerpt  is  shown  here: 

(if)  (:V  (hunger)  :IJ  (hunger)  :AJ  (hungry)) 

(:V  (validate)  :M  (validation  validity) 

:AJ  (valid)) 

(:V  (cross)  :W  (crossing  cross) 

:P  (across)) 

We  have  currently  developed  28,305  catvar 
clusters  for  40,443  POS  sub-cluster,  totaling 
46,037  words  (lexemes).  The  database  was  de¬ 
veloped  using  a  combination  of  resources  and 
algorithms  including  the  LCS  Verb  and  Prepo¬ 
sition  Databases  (Dorr,  2001b;  Dorr,  2001a), 
the  Brown  Corpus  section  of  the  Penn  Treebank 
(Marcus  et  ah,  1994),  an  English  morphological 
analysis  lexicon  developed  for  PC-Kimmo  (EN- 
GLEX)  (Antworth,  1990),  and  the  porter  stem- 
mer(Porter,  1980). 

3.3  The  Syntactic- Thematic  Linking 

Map 

This  is  a  large  matrix  that  was  extracted  from 
the  LCS  Verb  Database  (LVD)(Dorr,  2001b) 
and  the  LCS  Preposition  Database  (LPD)(Dorr, 
2001a).  ft  relates  syntactic  “cases”  to  thematic 
roles.  Thematic  cases  include  125  prepositions 
in  addition  to  :subj,  :obj,  and  :obj2.  These  are 
mapped  to  varying  subsets  of  the  20  different 
thematic  roles  used  in  our  system.  The  total 
number  of  links  is  341  pairs.  An  excerpt  of  this 
resource  is  shown  below. 

(12)  (:subj  ag  instr  th  exp  loc  src  goal  perc 
mod-poss  poss) 


(:obj2  goal  src  th  perc  ben) 

(aboard  loc  goal) 

(about  info  mod-perc  perc  poss  time  purp 
loc  goal  pred) 

(according_to  purp) 

(across  goal  loc) 

(in_spite_of  purp) 

(in  loc  mod-poss  perc  goal  poss  prop) 

4  The  Generation  Component 

The  input  to  the  generation  component  is  a  deep 
syntactic  dependency  tree  similar  to  Mel’cuk’s 
Meaning  Text  Theory  (MTT)  (Mel’cuk,  1988), 
but  it  is  written  in  the  format  of  the  PEN¬ 
MAN  Sentence  Planning  Language  (SPL)  (Pen, 
1989).  The  part-of-speech  and  roles  defini¬ 
tions  are  very  small.  There  are  10  parts  of 
speech  (verb,  preposition,  noun,  proper  noun, 
adjective,  adverb,  determiner,  conjunction,  in¬ 
terjunction  and  punctuation)  and  only  4  roles. 
Subject,  object,  indirect  object  (which  map  to 
1,11,  and  111  in  MTT)  and  modifier.  All  nodes  in 
the  dependency  tree  are  expected  to  be  ambigu¬ 
ous  bags  of  lexemes.  Our  machine  translation 
approach  involves  a  lexical  translation  of  the 
parse-tree  nodes  corresponding  to  words  in  the 
source-language  sentence.  No  structural  trans¬ 
fer  is  required. 

4.1  Thematic  Linking 

The  first  step  in  our  system  is  to  turn  the  syn¬ 
tactic  dependency  input  into  a  thematic  depen¬ 
dency  tree.  The  syntax-thematic  hnking  here  is 
achieved  through  the  use  of  thematic  grids  asso¬ 
ciated  with  English  (verbal)  head  nodes.  This 
step  is  done  in  the  generation  process  using  the 
target-language  resources  only.  Therefore,  it  is 
a  loose  linking  algorithm  that  is  constrained  by 
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the  thematic  grids  of  the  predicates  in  the  target 
language  (verbs  and  prepositions). 

Prepositions  are  treated  as  syntactic  case 
markers  that  constrain  the  option  of  thematic 
roles  that  can  be  assigned  to  their  objects.  The 
number  and  nature  (obligatory,  optional)  of  the 
thematic  roles  are  determined  by  the  verb  the¬ 
matic  grid.  We  treat  the  Unking  problem  as  a 
maximum  flow  network  variant  that  uses  link¬ 
ing  constraints  from  the  verbs  and  prepositions 
in  addition  to  applying  a  Thematic  Hierarchy 
constraint^  and  allowing  all  syntactic  roles  to 
be  treated  as  modihers  as  an  option.  There¬ 
fore,  we  are  guaranteed  to  get  a  network  in  every 
case.  However,  the  different  resulting  networks 
are  ranked  by  a  criterion  that  prefers  obligatory 
thematic  roles  to  be  linked,  prioritizing  Unkings 
involving  arguments  ahead  of  those  involving 
modihers. 

Figure  (.3)  illustrates  how  the  correct  map¬ 
ping  from  syntax  to  thematic  roles  is  done  for 
the  two  sentences  Mary  filled  the  glass  with  wa¬ 
ter  and  Mary  filled  water  in  the  glass.  Although 
the  second  example  is  not  correct  English  (al¬ 
beit  good  Korean),  the  correct  roles  are  assigned 
mainly  because  of  the  limitations  imposed  by 
allowable  thematic  assignments  for  the  preposi¬ 
tions. 

The  output  of  this  phase  is  a  thematic  depen- 

^We  make  an  assumption  here  that  there  is  a  Uni¬ 
versal  Thematic  Hierarchy  that  governs  the  generation 
of  arguments.  Predicates  that  violate  the  thematic  hier¬ 
archy  are  expected  to  be  marked  as  externalizing  pred¬ 
icates  in  both  source  and  target  languages(Habash  and 
Dorr,  2001) 


dency  in  which  the  relations  of  children  to  par¬ 
ents  are  thematic  roles  (and  modihers)  instead 
of  syntactic  roles.  The  goals  of  this  phase  are 
many:  1)  Reduction  of  ambiguity.  Since  each 
verb  can  have  multiple  verb  class  memberships 
(some  of  which  have  multiple  thematic  grids), 
this  step  reduces  the  verb/verb-class/grid  possi¬ 
bilities  to  only  those  that  rank  highest  according 
to  the  criteria  described  earlier.  2)  Normaliza¬ 
tion:  This  step  normalizes  over  structural  varia¬ 
tion  and  thus  approaches  a  solution  to  the  struc¬ 
tural  and  thematic  divergences  on  a  thematic 
level. 3)  Accurate  thematic  assignment,  which 
is  essential  for  handling  structural  variations. 

This  step  looks  like  analysis  but  it  is  fully 
driven  by  resources  and  constraints  from  the 
target  language.  That  is  why  it  is  a  central  step 
in  this  generation-heavy  approach. 

4.2  Structural  Expansion 

This  step  is  for  exploring  alternative  structural 
conhgurations  of  the  input.  There  are  two  op¬ 
erations  that  are  applied  here:  Conflation  and 
Head  Swapping.  Lexical-semantic  information 
from  the  verb  class  lexicon  (both  theta  grids  and 
lexical  conceptual  primitives)  is  used  to  deter¬ 
mine  the  conflatabiUty  and  head-swappability 
of  combinations  of  nodes  in  the  trees. 

4.2.1  Conflation 

For  each  one  of  the  arguments  of  a  given  verb 
in  the  tree,  the  head  verb  (Vhead)  and  argument 
(  Arg)  pair  are  checked  for  conflat ability.  A  pair 

“^This  also  applies  to  expanding  the  possible  set  of 
alternations  eventnally. 
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is  conflatable  if  (1)  there  exists  a  verb  Vconf 
that  is  a  categorial  variation  of  the  argument 
(2)  Vconf  and  Vhead  both  share  the  same  main 
lexical  conceptual  primitive  and  (3)  Vconf  can 
assign  the  same  thematic  roles  that  are  assigned 
by  Vhead  except  for  the  one  that  is  assigned  to 
Arg.  Take  the  following  example  for  the  Span¬ 
ish  yo  le  di  pualadas  a  Juan  (I  gave  stabs  to 
Juan)  which  results  in  the  following  thematic 
dependency  tree  after  linking  is  done: 

(13)  (3  \  Igivel 

:ag  (1  \  III) 

: th  (4  \  I  stab  I ) 

:goal  (6  \  I juani ) ) 

The  theme  |  stab  |  has  a  verb  categorial  vari¬ 
ation  I  stab  I  which  belongs  to  two  different 
verb  classes,  the  Poison  Verbs  (as  in  crucify, 
electrocute,  etc.)  and  the  Swat  verbs  (as  in 
bite,  claw,  etc.).  Only  the  hrst  class  shares  the 
same  lexical  conceptual  primitive  as  the  verb 
Igivel  (CAUSE  GO) .  Moreover,  the  verb  |stab| 
requires  an  agent  and  a  goal  for  the  stabbing. 
Therefore,  a  conflated  instance  is  created  in  this 
case: 

(14)  (3  \  I stabi 

:ag  (1  \  III) 

:goal  (6  \  I juani ) ) 

If  the  sentence  were,  say,  I  gave  the  stab  a 
name,  the  categorial  variation  for  stab  would 
have  failed  because  it  stood  in  a  goal  relation¬ 
ship  with  its  parent. 

4.2.2  Head  Swapping 

Unhke  Conflation,  Head  Swapping  is  restricted 
to  head-modiher  pairs.  Every  such  pair’s  swap- 
pability  is  determined  by  the  following  criteria: 
(1)  there  exists  a  verb  Vconf  that  is  a  categorial 
variation  of  the  modiher  (2)  there  is  a  catego¬ 
rial  variation  of  Vhead  that  can  become  a  child  of 


Vconf  such  as  a  noun,  adjective,  adverb  or  even 
a  preposition.  (3)  all  the  arguments  before  the 
swapping  retain  their  thematic  roles  regardless 
of  whether  they  move  with  the  swapped  verb 
or  not.  For  example,  the  German  ich  esse  gern 
(I  eat  likingly)  results  in  the  following  thematic 
dependency  tree  after  linking  is  done: 

(15)  (3  \  I  eat  I 

:th  (1  \  III) 

:mod  (6  \  I  like  I ) ) 

Here  the  modiher  |like|  and  the  main  verb 
I  eat  I  can  be  swapped  to  produce  I  like  eating 
or  I  like  to  eat.  If  the  demoted  verb  can  become 
a  preposition,  the  swapping  is  more  comphcated 
since  prepositions  are  not  part  of  the  thematic 
dependency.  For  example,  the  Spanish  Juan 
cruzo  el  n'o  nadando  (Juan  crossed  the  river 
swimming)  results  in  the  following  thematic  de¬ 
pendency  tree  after  linking  is  done: 

(16)  (3  \  I  cross  I 

:th  (1  \  I Juani ) 

: loc  (4  \  I  river  I ) 

:mod  (6  \  I swimi ) ) 

The  modiher  | swimi  is  itself  a  verb.  And 
the  main  verb  |  cross  |  has  a  prepositional  cat¬ 
egorial  variation  |  across  |  which  can  assign  the 
thematic  role  :  loc  to  |  river  |: 

(17)  (3  \  I swimi 

:th  (1  \  I Juani ) 

:mod  (4  \  I  river  I  :prep  lacrossi)) 

4.3  Syntactic  Assignment 

In  this  step,  the  thematic  dependency  is  turned 
into  a  fuU  target  syntactic  dependency.  Syn¬ 
tactic  positions  are  assigned  to  thematic  roles 
using  the  verb  class  subcategorization  frames. 
Different  alternations  associated  with  a  single 


class  are  generated  here  too  which  allows  for  a 
widening  range  of  expression  that  is  specihc  to 
the  target  langnage.  Class  category  specihca- 
tions  are  enforced  by  picked  appropriate  cate- 
gorial  variations  of  the  different  argnments.  For 
example,  the  main  verb  for  the  Spanish  tengo 
hambre  (I  have  hunger)  translates  into  (have, 
own,  possess,  and  be).  For  the  last  verb  (be), 
there  are  different  classes  that  have  different 
specihcations  on  the  verb’s  second  argnment:  a 
nonn  and  an  adjective.  This  of  conrse  resnlts 
in  I  am  hungry  and  I  am  hunger  in  addition  to 
I  (have/possess/own)  a  hunger.  That  is  where 
statistical  extraction  is  most  valnable;  to  decide 
which  seqnence  is  more  likely. 

4.4  Linearization 

In  this  step  a  rnle  based  linearization  gram¬ 
mar  is  nsed  to  create  a  word  lattice  that  en¬ 
codes  the  different  possible  realizations  of  the 
sentence.  The  grammar  is  implemented  nsing 
the  hnearization  engine  oxyGen(Habash,  2000) 
and  makes  nse  of  the  morphological  generation 
component  of  the  generation  system  Nitrogen 
(Langkilde  and  Knight,  1998b).  The  gram¬ 
mar  is  based  on  previons  work  we  have  done 
in  Chinese- English  LCS-based  MT(Dorr  et  ah, 
1998;  Tranm  and  Habash,  2000). 

4.5  Statistical  Extraction 

The  hnal  step,  extracting  a  preferred  sentence 
from  the  word  lattice  of  possibihties  is  done 
nsing  Nitrogen’s  Statistical  Extractor  withont 
any  changes.  Sentences  are  scored  nsing  nni- 
gram  and  bigram  freqnencies  calcnlated  based 
on  two  years  of  Wall  Street  Jonrnal  (Langkilde 
and  Knight,  1998c). 

5  Preliminary  Evaluation 

We  condncted  the  following  evalnation  to  assess 
the  applicability  of  the  approach  on  handhng 
Spanish- English  translation  divergences.  The 
data  we  nse  for  onr  evalnation  is  the  hrst  48  verb 
nniqne  instances  of  Spanish- English  variations 
from  the  El  Norte  Corpns.  Ont  of  the  48  sen¬ 
tences,  39  (81%)  were  conhrmed  to  be  resolved 
given  onr  approach,  i.e.,  these  divergences  conld 
be  generated  nsing  the  simple  lexical  semantics 
we  employ  together  with  the  strnctnral  expan¬ 
sion  and  categorial  variations. 

On  the  other  hand,  7  cases  (14.5%)  wonld 
reqnire  more  conceptnal  knowledge.  For  exam¬ 


ple,  the  expression  dar  muerte  a  (to  give  death 
to)  which  translates  into  kill  cannot  be  gener¬ 
ated  cnrrently  given  that  in  onr  lexicon,  kill  and 
death  are  not  linked  at  aU.  The  only  verbal  cat¬ 
egorial  variation  of  death  is  deaden  and  that  is 
not  an  appropriate  translation  here.  Generat¬ 
ing  a  link  between  deaden  and  kill  reqnires  an¬ 
other  more  conceptnal  resonrce  snch  as  the  Sen- 
sns  Ontology  (Knight  and  Lnk,  1994).  Even  a 
simpler  lexical  database  snch  as  WordNet  (Fell- 
banm,  1998)  does  not  have  a  synset  relating 
these  two  verbs.  Snch  expansion  is  still  very 
mnch  in  the  spirit  of  generation-heavy  machine 
translation  since  all  of  the  new  knowledge  is  rep¬ 
resented  in  the  target  langnage. 

The  remaining  2  cases  (4%)  ont  of  the  48 
sentences  reqnire  pragmatic  knowledge  and/or 
hard-wiring  of  idiomatic  non-decompositional 
strnctnres.  For  example  the  Spanish  ponerse  de 
pie  (put-self  of/on  foot)  shonld  translate  into  to 
stand  up. 

6  Future  Work 

Onr  immediate  fntnre  work  will  involve  an  ex¬ 
pansion  of  the  hnearization  grammar  to  be  able 
to  handle  large-scale  Spanish- English  GHMT. 
We  also  plan  to  explore  extensions  to  the  sym¬ 
bolic  component  of  onr  system,  e.g.,  a  concep¬ 
tnal  representation  that  facilitates  generation 
by  linking  concepts  that  are  not  related  mor- 
phologicahy.  In  addition,  we  plan  to  explore 
extensions  to  the  statistical  component  throngh 
the  nse  of  strnctnral  bigrams.  And  hnally,  we 
are  interested  in  testing  onr  sonrce-langnage  in¬ 
dependence  claim  by  retargeting  the  system  to 
Chinese  inpnt. 
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